Segmentation in weakly labeled videos via a semantic ranking and optical warping network

Abstract

Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos.

Challenges

Basically, there are three challenging factors which may influence the performance of WSVOS:

1. Foreground Object Discovery: Discovering foreground objects in each frame without detailed annotations.

2. Coarse Object Semantic Consistency: Maintaining semantic consistency within each video across different frames.

3. Fine-grained Segmentation Smoothness: Ensuring smooth segmentation boundaries within neighbor frames.

Methodology

In this paper, we establish a semantic ranking and optical warping network to simultaneously solve these three challenges in a unified framework.

For Challenge 1: We apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network.

Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges:

For Challenge 2: We propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks.

For Challenge 3: We propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames.

Experimental Results

Experiments on two benchmark data sets, i.e., DAVIS dataset and YouTube-Objects dataset, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.

The results show that by combining semantic ranking with optical warping, our method can effectively handle the challenges in weakly supervised video object segmentation and produce high-quality segmentation masks that are both semantically consistent and temporally smooth.

Keywords: Deep Learning Computer vision Video semantic ranking neural network Segmentation Object Segmentation Optical warping weakly supervised Instance Segmentation

Segmentation in weakly labeled videos via a semantic ranking and optical warping network

Abstract

Challenges

Methodology

Experimental Results

📚 Cite This Work